UPSTREAM PR #18994: server : support preserving reasoning_content in assistant message#992
UPSTREAM PR #18994: server : support preserving reasoning_content in assistant message#992
Conversation
Performance Review Report: llama.cpp Version ComparisonImpact Classification: Minor ImpactTotal Functions Analyzed: 12 functions across 2 binaries (llama-tts, llama-cvector-generator) Executive SummaryPerformance analysis reveals no meaningful impact on LLM inference. All 12 functions are non-critical utility operations (logging, templates, argument parsing) that execute during initialization, not in inference loops. Changes stem primarily from compiler optimization differences, with one intentional maintainability refactoring. Key FindingsCommit Context (3 commits by Xuan Son Nguyen):
Most Impacted Function:
STL Functions (9 functions):
JSON Library (1 function):
Performance-Critical AssessmentNone of the 12 functions are performance-critical. llama.cpp's bottlenecks are:
All analyzed functions execute outside these hot paths. Power Consumption & GPU ImpactPower Impact: Negligible (<0.01% of total system power) ConclusionThe performance changes are appropriate and justified. The jinja refactoring prioritizes maintainability over micro-optimization in non-critical code. Compiler-driven STL optimizations show expected variance between builds. No optimization action required. |
095e526 to
db6cb7a
Compare
Performance Review Report: llama.cpp Reasoning Content SupportExecutive SummaryAnalysis of 15 functions across llama-tts and llama-cvector-generator binaries reveals moderate performance impact from architectural changes supporting reasoning-focused LLMs (DeepSeek R1, OpenAI o1). The largest regression (+8,268 ns) occurs in debug-only code, while core inference paths remain unaffected. Commit ContextFive commits by Xuan Son Nguyen implement reasoning content support:
Files changed: 10 modified, 37 added, 3 deleted Performance Impact AnalysisMost-Impacted Functions1. jinja::caps::to_string() (both binaries)
2. std::vector<>::begin() (2 instances)
3. httplib::Request::Request() (cvector-generator)
Performance Improvementsstd::vector operations (6 functions): 28-57% response time improvements (-64 to -203 ns) from compiler optimizations and template consolidation in chat message handling. Code Changes AssessmentThe refactoring prioritizes maintainability over micro-optimization in non-critical paths:
All changes are justified - regressions occur exclusively in initialization and debug code, while actual execution paths show improvements or no change. Power ConsumptionNet execution time increase: +16,446 ns per cycle across affected functions. However, power impact is negligible (<0.01% of total application consumption) because:
GPU/ML OperationsNo impact - zero analyzed functions involve GPU operations, tensor computations, or inference kernels. Changes affect only CPU-side template processing and initialization. ConclusionPerformance changes are acceptable and justified. The 2,090% regression in |
984fada to
54e0744
Compare
|
@loci-dev how many functions were analyzed ? |
|
According to the performance review report, 15 functions were analyzed across the llama-tts and llama-cvector-generator binaries. These 15 functions include:
The analysis focused on functions impacted by the architectural changes supporting reasoning-focused LLMs, with changes ranging from significant regressions in debug-only code to improvements in vector operations. |
0e2fcc8 to
5668a6a
Compare
048ad94 to
6c1fde6
Compare
823244c to
bab7d39
Compare
10f8f26 to
a6ecec6
Compare
Mirrored from ggml-org/llama.cpp#18994
Ref: ggml-org/llama.cpp#18936 (comment)
Changes included in this PR
json_fwdinchat.hto avoid usingtemplatetrickcommon_chat_msgs_to_json_oaicompatandcommon_chat_msg::to_json_oaicompat()clear_thinking = falsefor GLM 4.7 if it is not specifiedsupports_preserve_reasoningto server/props(Web UI support is TBD)
Changes in API
The
/chat/completionsAPI now acceptsreasoning_contentfor assistant message:{ "messages": [ { "content": "Hello, world!", "role": "user" }, { "content": "Hey there!", "role": "assistant", "reasoning_content": "This is my reasoning." }, { "content": "Hello, world!", "role": "user" } ], "stream": false, "max_tokens": 64 }If the template supports it, the reasoning will be put back into the message (testing with GLM 4.7)
[gMASK]<sop><|user|>Hello, world!<|assistant|><think>This is my reasoning.</think>Hey there!<|user|>Hello, world!<|assistant|><think>Otherwise, it will be ignored.
To know if the template supports it,
/propsendpoint will indicate:{ "chat_template_caps": { ... "supports_preserve_reasoning": true, ... } }